Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

نویسندگان

چکیده

Medical vision-and-language pre-training provides a feasible solution to extract effective representations from medical images and texts. However, few studies have been dedicated this field facilitate understanding. In paper, we propose self-supervised learning paradigm with multi-modal masked autoencoders (M $$^3$$ AE), which learn cross-modal domain knowledge by reconstructing missing pixels tokens randomly There are three key designs make simple approach work. First, considering the different information densities of vision language, adopt masking ratios for input image text, where considerably larger ratio is used images. Second, use visual textual features layers perform reconstruction deal levels abstraction in language. Third, develop language decoders (i.e., Transformer multi-layer perceptron language). To comprehensive evaluation further research, construct benchmark including tasks. Experimental results demonstrate effectiveness our approach, state-of-the-art achieved on all downstream Besides, conduct analysis better verify components various settings pre-training. The source code available at https://github.com/zhjohnchan/M3AE .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pre-Training CNNs Using Convolutional Autoencoders

Despite convolutional neural networks being the state of the art in almost all computer vision tasks, their training remains a difficult task. Unsupervised representation learning using a convolutional autoencoder can be used to initialize network weights and has been shown to improve test accuracy after training. We reproduce previous results using this approach and successfully apply it to th...

متن کامل

Faster learning of deep stacked autoencoders on multi-core systems using synchronized layer-wise pre-training

Deep neural networks are capable of modelling highly nonlinear functions by capturing different levels of abstraction of data hierarchically. While training deep networks, first the system is initialized near a good optimum by greedy layer-wise unsupervised pre-training. However, with burgeoning data and increasing dimensions of the architecture, the time complexity of this approach becomes eno...

متن کامل

Convergence of gradient based pre-training in Denoising autoencoders

The success of deep architectures is at least in part attributed to the layer-by-layer unsupervised pre-training that initializes the network. Various papers have reported extensive empirical analysis focusing on the design and implementation of good pre-training procedures. However, an understanding pertaining to the consistency of parameter estimates, the convergence of learning procedures an...

متن کامل

Pre-training of Recurrent Neural Networks via Linear Autoencoders

We propose a pre-training technique for recurrent neural networks based on linear autoencoder networks for sequences, i.e. linear dynamical systems modelling the target sequences. We start by giving a closed form solution for the definition of the optimal weights of a linear autoencoder given a training set of sequences. This solution, however, is computationally very demanding, so we suggest a...

متن کامل

Language Resources for Multi-Modal Dialogue Systems

This paper reviews a resource base of software agents for hub-based architectures, which can be used generally for advanced dialogue systems research and deployment. The problem of domain-specificity of dialogue managers is discussed, and we describe an approach to it developed at CSLI, involving a domain-general dialogue manager with application specific “Activity Models”. We also describe rel...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2022

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-031-16443-9_65